fix(timing): warmup pass before timing loop to amortise torch.compile JIT by ivanbasov · Pull Request #70 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-20T18:35:49Z

Summary

Adds a single warmup forward pass through pipeline_module before the timing loop in run_inference_and_decode_pre_decoder_memory
Triggers torch.compile lazy compilation so the JIT cost does not inflate the first-batch timing measurement
Guard: only runs when trt_context is None and _applied_compile (torch-only path with compile enabled)
CUDA sync after the warmup pass on GPU devices
Warmup logic extracted into _maybe_warmup_compile helper with 5 unit tests

Motivation

Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost, skewing Phase Timing numbers — especially at low sample counts (PREDECODER_INFERENCE_NUM_SAMPLES=1):

	Model forward (first batch)
Before	~887 ms
After	~1 ms

With large sample counts the JIT cost gets amortised naturally, but at small counts it dominates and makes Phase Timing numbers misleading. Proposed by Igor Almeida Baratta; approved by Ben Howe.

Test plan

Existing unit tests pass (test_inference_latency_timing.py, test_tensorrt_fallback.py)
Run with PREDECODER_INFERENCE_NUM_SAMPLES=1, confirm first-batch model-forward time matches steady-state
Run with TRT enabled, confirm warmup block is skipped
CI green

🤖 Generated with Claude Code

… JIT Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost (~887 ms vs ~1 ms steady-state), skewing Phase Timing numbers — especially at low sample counts like PREDECODER_INFERENCE_NUM_SAMPLES=1. The warmup only runs when torch.compile is active and TRT is not in use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extracts the warmup block into a named helper so it can be tested in isolation. Five tests cover: fires when compile is active (CPU), skipped when compile is off, skipped when TRT context is present, CUDA sync called on GPU device, CUDA sync not called on CPU device. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

IgorBaratta

LGTM

ivanbasov and others added 3 commits April 20, 2026 11:34

style: yapf formatting on test_inference_latency_timing

2427b59

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov marked this pull request as ready for review April 20, 2026 18:37

ivanbasov requested review from IgorBaratta, bmhowe23 and kvmto and removed request for kvmto April 20, 2026 18:37

IgorBaratta approved these changes Apr 22, 2026

View reviewed changes

bmhowe23 approved these changes Apr 22, 2026

View reviewed changes

ivanbasov merged commit c71b4e4 into main Apr 23, 2026
17 checks passed

ivanbasov deleted the fix/timing-compile-warmup branch April 23, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#70

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#70
ivanbasov merged 3 commits into
mainfrom
fix/timing-compile-warmup

ivanbasov commented Apr 20, 2026

Uh oh!

IgorBaratta left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivanbasov commented Apr 20, 2026

Summary

Motivation

Test plan

Uh oh!

IgorBaratta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants